AITopics

Country: North America > United States (0.28)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Vision (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Neural Information Processing SystemsFeb-8-2026, 08:34:57 GMT

BlockGAN: Learning 3D Object-aware Scene Representations from Unlabelled Images

Thu Nguyen-Phuoc, Christian Richardt, Long Mai, Yong-Liang Yang, Niloy Mitra

The computer graphics pipeline has achieved impressive results in generating high-quality images, while offering users a great level of freedom and controllability over the generated images. This has many applications in creating and editing content for the creative industries, such as films, games, scientific visualisation, and more recently, in generating training data for computer vision tasks.

artificial intelligence, blockgan, machine learning, (19 more...)

Country:

North America > Canada (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Neural Information Processing SystemsFeb-8-2026, 07:25:40 GMT

3c8a49145944fed2bbcaade178a426c4-Paper.pdf

instruction, navigation, transformer, (15 more...)

Country:

North America > United States > Oregon (0.04)
Asia > India (0.04)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Vision (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Neural Information Processing SystemsOct-2-2025, 20:53:30 GMT

BlockGAN: Learning 3D Object-aware Scene Representations from Unlabelled Images

Thu Nguyen-Phuoc, Christian Richardt, Long Mai, Yong-Liang Yang, Niloy Mitra

The computer graphics pipeline has achieved impressive results in generating high-quality images, while offering users a great level of freedom and controllability over the generated images. This has many applications in creating and editing content for the creative industries, such as films, games, scientific visualisation, and more recently, in generating training data for computer vision tasks.

artificial intelligence, blockgan, machine learning, (19 more...)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

arXiv.org Artificial IntelligenceJun-6-2025

ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL

Zhang, Yu, Li, Yunqi, Yang, Yifan, Wang, Rui, Yang, Yuqing, Qi, Dai, Bao, Jianmin, Chen, Dongdong, Luo, Chong, Qiu, Lili

Although chain-of-thought reasoning and reinforcement learning (RL) have driven breakthroughs in NLP, their integration into generative vision models remains underexplored. We introduce ReasonGen-R1, a two-stage framework that first imbues an autoregressive image generator with explicit text-based "thinking" skills via supervised fine-tuning on a newly generated reasoning dataset of written rationales, and then refines its outputs using Group Relative Policy Optimization. To enable the model to reason through text before generating images, We automatically generate and release a corpus of model crafted rationales paired with visual prompts, enabling controlled planning of object layouts, styles, and scene compositions. Our GRPO algorithm uses reward signals from a pretrained vision language model to assess overall visual quality, optimizing the policy in each update. Evaluations on GenEval, DPG, and the T2I benchmark demonstrate that ReasonGen-R1 consistently outperforms strong baselines and prior state-of-the-art models. More: aka.ms/reasongen.

background, large language model, machine learning, (20 more...)

2505.24875

Genre: Research Report > New Finding (0.67)

Industry: Transportation (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(3 more...)

Rezaei, Mohammad Ali, Ayar, Fardin, Javanmardi, Ehsan, Tsukada, Manabu, Javanmardi, Mahdi

Where Do You Go? Pedestrian Trajectory Prediction using Scene Features

arXiv.org Artificial IntelligenceJan-23-2025

Accurate prediction of pedestrian trajectories is crucial for enhancing the safety of autonomous vehicles and reducing traffic fatalities involving pedestrians. While numerous studies have focused on modeling interactions among pedestrians to forecast their movements, the influence of environmental factors and scene-object placements has been comparatively underexplored. In this paper, we present a novel trajectory prediction model that integrates both pedestrian interactions and environmental context to improve prediction accuracy. Our approach captures spatial and temporal interactions among pedestrians within a sparse graph framework. To account for pedestrian-scene interactions, we employ advanced image enhancement and semantic segmentation techniques to extract detailed scene features. These scene and interaction features are then fused through a cross-attention mechanism, enabling the model to prioritize relevant environmental factors that influence pedestrian movements. Finally, a temporal convolutional network processes the fused features to predict future pedestrian trajectories. Experimental results demonstrate that our method significantly outperforms existing state-of-the-art approaches, achieving ADE and FDE values of 0.252 and 0.372 meters, respectively, underscoring the importance of incorporating both social interactions and environmental context in pedestrian trajectory prediction.

artificial intelligence, pedestrian trajectory prediction, scene feature

2501.13848

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence (0.87)

Dagan, Gautier, Loginova, Olga, Batra, Anil

CAST: Cross-modal Alignment Similarity Test for Vision Language Models

arXiv.org Artificial IntelligenceSep-17-2024

Vision Language Models (VLMs) are typically evaluated with Visual Question Answering (VQA) tasks which assess a model's understanding of scenes. Good VQA performance is taken as evidence that the model will perform well on a broader range of tasks that require both visual and language inputs. However, scene-aware VQA does not fully capture input biases or assess hallucinations caused by a misalignment between modalities. To address this, we propose a Cross-modal Alignment Similarity Test (CAST) to probe VLMs for self-consistency across modalities. This test involves asking the models to identify similarities between two scenes through text-only, image-only, or both and then assess the truthfulness of the similarities they generate. Since there is no ground-truth to compare against, this evaluation does not focus on objective accuracy but rather on whether VLMs are internally consistent in their outputs. We argue that while not all self-consistent models are capable or accurate, all capable VLMs must be self-consistent.

light fixture, scene feature, similarity, (16 more...)

2409.11007

Country:

North America > United States > California (0.04)
North America > Mexico > Mexico City > Mexico City (0.04)
Europe > United Kingdom > Scotland (0.04)
Asia > Singapore (0.04)

Genre: Research Report (0.82)

Industry: Transportation (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)

Villar-Corrales, Angel, Austermann, Moritz, Behnke, Sven

MCDS-VSS: Moving Camera Dynamic Scene Video Semantic Segmentation by Filtering with Self-Supervised Geometry and Motion

arXiv.org Artificial IntelligenceMay-30-2024

Autonomous systems, such as self-driving cars, rely on reliable semantic environment perception for decision making. Despite great advances in video semantic segmentation, existing approaches ignore important inductive biases and lack structured and interpretable internal representations. In this work, we propose MCDS-VSS, a structured filter model that learns in a self-supervised manner to estimate scene geometry and ego-motion of the camera, while also estimating the motion of external objects. Our model leverages these representations to improve the temporal consistency of semantic segmentation without sacrificing segmentation accuracy. MCDS-VSS follows a prediction-fusion approach in which scene geometry and camera motion are first used to compensate for ego-motion, then residual flow is used to compensate motion of dynamic objects, and finally the predicted scene features are fused with the current features to obtain a temporally consistent scene segmentation. Our model parses automotive scenes into multiple decoupled interpretable representations such as scene geometry, ego-motion, and object motion. Quantitative evaluation shows that MCDS-VSS achieves superior temporal consistency on video sequences while retaining competitive segmentation performance.

computer vision, segmentation, semantic segmentation, (12 more...)

2405.19921

Country: Europe > Germany > North Rhine-Westphalia > Cologne Region > Bonn (0.04)

Genre: Research Report (0.64)

Industry:

Transportation > Ground > Road (0.34)
Information Technology > Robotics & Automation (0.34)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.86)
(2 more...)

arXiv.org Artificial IntelligenceAug-7-2023

Measuring and Modeling Physical Intrinsic Motivation

Martinez, Julio, Binder, Felix, Wang, Haoliang, Haber, Nick, Fan, Judith, Yamins, Daniel L. K.

Humans are interactive agents driven to seek out situations with interesting physical dynamics. Here we formalize the functional form of physical intrinsic motivation. We first collect ratings of how interesting humans find a variety of physics scenarios. We then model human interestingness responses by implementing various hypotheses of intrinsic motivation including models that rely on simple scene features to models that depend on forward physics prediction. We find that the single best predictor of human responses is adversarial reward, a model derived from physical prediction loss. We also find that simple scene feature models do not generalize their prediction of human responses across all scenarios. Finally, linearly combining the adversarial model with the number of collisions in a scene leads to the greatest improvement in predictivity of human responses, suggesting humans are driven towards scenarios that result in high information gain and physical activity.

artificial intelligence, machine learning, scenario, (18 more...)

2305.13452

Country:

North America > United States > California > San Diego County > San Diego (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre: Research Report (0.50)

Industry: Health & Medicine (0.49)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (0.89)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.89)

arXiv.org Artificial IntelligenceAug-2-2023

Interpretable End-to-End Driving Model for Implicit Scene Understanding

Sun, Yiyang, Wang, Xiaonian, Zhang, Yangyang, Tang, Jiagui, Tang, Xiaqiang, Yao, Jing

Driving scene understanding is to obtain comprehensive scene information through the sensor data and provide a basis for downstream tasks, which is indispensable for the safety of self-driving vehicles. Specific perception tasks, such as object detection and scene graph generation, are commonly used. However, the results of these tasks are only equivalent to the characterization of sampling from high-dimensional scene features, which are not sufficient to represent the scenario. In addition, the goal of perception tasks is inconsistent with human driving that just focuses on what may affect the ego-trajectory. Therefore, we propose an end-to-end Interpretable Implicit Driving Scene Understanding (II-DSU) model to extract implicit high-dimensional scene features as scene understanding results guided by a planning module and to validate the plausibility of scene understanding using auxiliary perception tasks for visualization. Experimental results on CARLA benchmarks show that our approach achieves the new state-of-the-art and is able to obtain scene features that embody richer scene information relevant to driving, enabling superior performance of the downstream planning.

artificial intelligence, information, perception task, (15 more...)

2308.0118

Country: Asia > China (0.04)

Genre: Research Report (0.50)

Industry: Transportation > Ground > Road (0.30)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (1.00)